{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Batch Reinforcement Learning with Coach"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In many real-world problems, a learning agent cannot interact with the real environment or with a simlulated one. This might be due to the risk of taking sub-optimal actions in the real world, or due to the complexity of creating a simluator that immitates correctly the real environment dynamics. In such cases, the learning agent is only exposed to data that was collected using some deployed policy, and we would like to use that data to learn a better policy for solving the problem. \n",
    "One such example might be developing a better drug dose or admission scheduling policy. We have data based on the policy that was used with patients so far, but cannot experiment (and explore) on patients to collect new data. \n",
    "\n",
    "But wait... If we don't have a simulator, how would we evaluate our newly learned policy and know if it is any good? Which algorithms should we be using in order to better address the problem of learning only from a batch of data? \n",
    "\n",
    "Alternatively, what do we do if we don't have a simulator, but instead we can actually deploy our policy on that real-world environment, and would just like to separate the new data collection part from the learning part (i.e. if we have a system that can quite easily run inference, but is very hard to integrate a reinforcement learning framework with, such as Coach, for learning a new policy).\n",
    "\n",
    "We will try to address these questions and more in this tutorial, demonstrating how to use [Batch Reinforcement Learning](http://tgabel.de/cms/fileadmin/user_upload/documents/Lange_Gabel_EtAl_RL-Book-12.pdf). \n",
    "\n",
    "First, let's use a simple environment to collect the data to be used for learning a policy using Batch RL. In reality, we probably would already have a dataset of transitions of the form `<current_observation, action, reward, next_state>` to be used for learning a new policy. Ideally, we would also have, for each transtion, $p(a|o)$ the probabilty of an action, given that transition's `current_observation`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preliminaries\n",
    "First, get the required imports and other general settings we need for this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solving Acrobot with Batch RL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from copy import deepcopy\n",
    "import tensorflow as tf\n",
    "import os\n",
    "\n",
    "from rl_coach.agents.dqn_agent import DQNAgentParameters\n",
    "from rl_coach.agents.ddqn_bcq_agent import DDQNBCQAgentParameters, KNNParameters\n",
    "from rl_coach.base_parameters import VisualizationParameters\n",
    "from rl_coach.core_types import TrainingSteps, EnvironmentEpisodes, EnvironmentSteps, CsvDataset\n",
    "from rl_coach.environments.gym_environment import GymVectorEnvironment\n",
    "from rl_coach.graph_managers.batch_rl_graph_manager import BatchRLGraphManager\n",
    "from rl_coach.graph_managers.graph_manager import ScheduleParameters\n",
    "from rl_coach.memories.memory import MemoryGranularity\n",
    "from rl_coach.schedules import LinearSchedule\n",
    "from rl_coach.memories.episodic import EpisodicExperienceReplayParameters\n",
    "from rl_coach.architectures.head_parameters import QHeadParameters\n",
    "from rl_coach.agents.ddqn_agent import DDQNAgentParameters\n",
    "from rl_coach.base_parameters import TaskParameters\n",
    "from rl_coach.spaces import SpacesDefinition, DiscreteActionSpace, VectorObservationSpace, StateSpace, RewardSpace\n",
    "\n",
    "# Get all the outputs of this tutorial out of the 'Resources' folder\n",
    "os.chdir('Resources')\n",
    "\n",
    "# the dataset size to collect \n",
    "DATASET_SIZE = 50000\n",
    "\n",
    "task_parameters = TaskParameters(experiment_path='.')\n",
    "\n",
    "####################\n",
    "# Graph Scheduling #\n",
    "####################\n",
    "\n",
    "schedule_params = ScheduleParameters()\n",
    "\n",
    "# 100 epochs (we run train over all the dataset, every epoch) of training\n",
    "schedule_params.improve_steps = TrainingSteps(100)\n",
    "\n",
    "# we evaluate the model every epoch\n",
    "schedule_params.steps_between_evaluation_periods = TrainingSteps(1)\n",
    "\n",
    "# only for when we have an enviroment\n",
    "schedule_params.evaluation_steps = EnvironmentEpisodes(10)\n",
    "schedule_params.heatup_steps = EnvironmentSteps(DATASET_SIZE)\n",
    "\n",
    "################\n",
    "#  Environment #\n",
    "################\n",
    "env_params = GymVectorEnvironment(level='Acrobot-v1')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use OpenAI Gym's `Acrobot-v1` in order to collect a dataset of experience, and then use that dataset in order to learn a policy solving the environment using Batch RL. \n",
    "\n",
    "### The Preset \n",
    "\n",
    "First we will collect a dataset using a random action selecting policy. Then we will use that dataset to train an agent in a Batch RL fashion. <br>\n",
    "Let's start simple - training an agent with Double DQN. \n",
    "\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf.reset_default_graph() # just to clean things up; only needed for the tutorial\n",
    "\n",
    "#########\n",
    "# Agent #\n",
    "#########\n",
    "agent_params = DDQNAgentParameters()\n",
    "agent_params.network_wrappers['main'].batch_size = 128\n",
    "agent_params.algorithm.num_steps_between_copying_online_weights_to_target = TrainingSteps(100)\n",
    "agent_params.algorithm.discount = 0.99\n",
    "\n",
    "# to jump start the agent's q values, and speed things up, we'll initialize the last Dense layer's bias\n",
    "# with a number in the order of the discounted reward of a random policy\n",
    "agent_params.network_wrappers['main'].heads_parameters = \\\n",
    "[QHeadParameters(output_bias_initializer=tf.constant_initializer(-100))]\n",
    "\n",
    "# NN configuration\n",
    "agent_params.network_wrappers['main'].learning_rate = 0.0001\n",
    "agent_params.network_wrappers['main'].replace_mse_with_huber_loss = False\n",
    "\n",
    "# ER - we'll need an episodic replay buffer for off-policy evaluation\n",
    "agent_params.memory = EpisodicExperienceReplayParameters()\n",
    "\n",
    "# E-Greedy schedule - there is no exploration in Batch RL. Disabling E-Greedy. \n",
    "agent_params.exploration.epsilon_schedule = LinearSchedule(initial_value=0, final_value=0, decay_steps=1)\n",
    "agent_params.exploration.evaluation_epsilon = 0\n",
    "\n",
    "\n",
    "graph_manager = BatchRLGraphManager(agent_params=agent_params,\n",
    "                                    env_params=env_params,\n",
    "                                    schedule_params=schedule_params,\n",
    "                                    vis_params=VisualizationParameters(dump_signals_to_csv_every_x_episodes=1),\n",
    "                                    reward_model_num_epochs=30)\n",
    "graph_manager.create_graph(task_parameters)\n",
    "graph_manager.improve()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we see Coach running a long heatup of 50,000 steps (as we have defined a `DATASET_SIZE` of 50,000 in the preliminaries section), in order to collect a dataset of random actions. Then we can see Coach training a supervised reward model that is needed for the `Doubly Robust` OPE (off-policy evaluation). Last, Coach starts using the collected dataset of experience to train a Double DQN agent. Since, for this environment, we actually do have a simulator, Coach will be using it to evaluate the learned policy. As you can probably see, since this is a very simple environment, a dataset of just random actions is enough to get a Double DQN agent training, and reaching rewards of less than -100 (actually solving the environment). As you can also probably notice, the learning is not very stable, and if you take a look at the Q values predicted by the agent (e.g. in Coach Dashboard; this tutorial experiment results are under the `Resources` folder), you will see them increasing unboundedly. This is caused due to the Batch RL based learning, where not interacting with the environment any further, while randomly exposing only small parts of the MDP in the dataset, makes learning even harder than standard Off-Policy RL. This phenomena is very nicely explained in [Off-Policy Deep Reinforcement Learning without Exploration](https://arxiv.org/abs/1812.02900). We have implemented a discrete-actions variant of [Batch Constrained Q-Learning](https://github.com/NervanaSystems/coach/blob/master/rl_coach/agents/ddqn_bcq_agent.py), which helps mitigating this issue. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's switch to a dataset containing data combined from several 'deployed' policies, as is often the case in real-world scenarios, where we already have a policy (hopefully not a random one) in-place and we want to improve it. For instance, a recommender system already using a policy for generating recommendations, and we want to use Batch RL to learn a better policy. <br>\n",
    "\n",
    "We will demonstrate that by training an agent, and using its replay buffer content as the dataset from which we will learn a new policy, without any further interaction with the environment. This should allow for both a better trained agent and for more meaningful Off-Policy Evaluation (as the more extensive your input data is, i.e. exposing more of the MDP, the better the evaluation of a new policy based on it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf.reset_default_graph() # just to clean things up; only needed for the tutorial\n",
    "\n",
    "# Experience Generating Agent parameters\n",
    "experience_generating_agent_params = DDQNAgentParameters()\n",
    "\n",
    "# schedule parameters\n",
    "experience_generating_schedule_params = ScheduleParameters()\n",
    "experience_generating_schedule_params.heatup_steps = EnvironmentSteps(1000)\n",
    "experience_generating_schedule_params.improve_steps = TrainingSteps(\n",
    "    DATASET_SIZE - experience_generating_schedule_params.heatup_steps.num_steps)\n",
    "experience_generating_schedule_params.steps_between_evaluation_periods = EnvironmentEpisodes(10)\n",
    "experience_generating_schedule_params.evaluation_steps = EnvironmentEpisodes(1)\n",
    "\n",
    "# DQN params\n",
    "experience_generating_agent_params.algorithm.num_steps_between_copying_online_weights_to_target = EnvironmentSteps(100)\n",
    "experience_generating_agent_params.algorithm.discount = 0.99\n",
    "experience_generating_agent_params.algorithm.num_consecutive_playing_steps = EnvironmentSteps(1)\n",
    "\n",
    "# NN configuration\n",
    "experience_generating_agent_params.network_wrappers['main'].learning_rate = 0.0001\n",
    "experience_generating_agent_params.network_wrappers['main'].batch_size = 128\n",
    "experience_generating_agent_params.network_wrappers['main'].replace_mse_with_huber_loss = False\n",
    "experience_generating_agent_params.network_wrappers['main'].heads_parameters = \\\n",
    "[QHeadParameters(output_bias_initializer=tf.constant_initializer(-100))]\n",
    "\n",
    "# ER size\n",
    "experience_generating_agent_params.memory = EpisodicExperienceReplayParameters()\n",
    "experience_generating_agent_params.memory.max_size = \\\n",
    "    (MemoryGranularity.Transitions,\n",
    "     experience_generating_schedule_params.heatup_steps.num_steps +\n",
    "     experience_generating_schedule_params.improve_steps.num_steps)\n",
    "\n",
    "# E-Greedy schedule\n",
    "experience_generating_agent_params.exploration.epsilon_schedule = LinearSchedule(1.0, 0.01, DATASET_SIZE)\n",
    "experience_generating_agent_params.exploration.evaluation_epsilon = 0\n",
    "\n",
    "# 50 epochs of training (the entire dataset is used each epoch)\n",
    "schedule_params.improve_steps = TrainingSteps(50)\n",
    "\n",
    "graph_manager = BatchRLGraphManager(agent_params=agent_params,\n",
    "                                    experience_generating_agent_params=experience_generating_agent_params,\n",
    "                                    experience_generating_schedule_params=experience_generating_schedule_params,\n",
    "                                    env_params=env_params,\n",
    "                                    schedule_params=schedule_params,\n",
    "                                    vis_params=VisualizationParameters(dump_signals_to_csv_every_x_episodes=1),\n",
    "                                    reward_model_num_epochs=30,\n",
    "                                    train_to_eval_ratio=0.5)\n",
    "graph_manager.create_graph(task_parameters)\n",
    "graph_manager.improve()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Off-Policy Evaluation\n",
    "As we mentioned earlier, one of the hardest problems in Batch RL is that we do not have a simulator or cannot easily deploy a trained policy on the real-world environment, in order to test its goodness. This is where OPE comes in handy. \n",
    "\n",
    "Coach supports several off-policy evaluators, some are useful for bandits problems (only evaluating a single step return), and others are for full-blown Reinforcement Learning problems. The main goal of the OPEs is to help us select the best model, either for collecting more data to do another round of Batch RL on, or for actual deployment in the real-world environment. \n",
    "\n",
    "Opening the experiment that we have just ran (under the `tutorials/Resources` folder, with Coach Dashboard), you will be able to plot the actual simulator's `Evaluation Reward`. Usually, we won't have this signal available as we won't have a simulator, but since we're using a dummy environment for demonstration purposes, we can take a look and examine how the OPEs correlate with it. \n",
    "\n",
    "Here are two example plots from Dashboard showing how well the `Weighted Importance Sampling` (RL estimator) and the `Doubly Robust` (bandits estimator) each correlate with the `Evaluation Reward`. \n",
    "![Weighted Importance Sampling](Resources/img/wis.png \"Weighted Importance Sampling vs. Evaluation Reward\") \n",
    "\n",
    "![Doubly Robust](Resources/img/dr.png \"Doubly Robust vs. Evaluation Reward\") \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using a Dataset to Feed a Batch RL Algorithm "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, so we now understand how things are expected to work. But, hey... if we don't have a simulator (which we did have in this tutorial so far, and have used it to generate a training/evaluation dataset) how will we feed Coach with the dataset to train/evaluate on?\n",
    "\n",
    "### The CSV\n",
    "Coach defines a csv data format that can be used to fill its replay buffer. We have created an example csv from the same `Acrobot-v1` environment, and have placed it under the [Tutorials' Resources folder](https://github.com/NervanaSystems/coach/tree/master/tutorials/Resources).\n",
    "\n",
    "Here are the first couple of lines from it so you can get a grip of what to expect - "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "| action | all_action_probabilities | episode_id | episode_name | reward | transition_number | state_feature_0 | state_feature_1 | state_feature_2 | state_feature_3 | state_feature_4 | state_feature_5 \n",
    "|---|---|---|---|---|---|---|---|---|---|---|---------------------------------------------------------------------------|\n",
    "|0|[0.4159157,0.23191088,0.35217342]|0|acrobot|-1|0|0.996893843|0.078757007|0.997566524|0.069721088|-0.078539907|-0.072449002 |\n",
    "|1|[0.46244532,0.22402011,0.31353462]|0|acrobot|-1|1|0.997643051|0.068617369|0.999777604|0.021088905|-0.022653483|-0.40743716|\n",
    "|0|[0.4961428,0.21575058,0.2881066]|0|acrobot|-1|2|0.997613067|0.069051922|0.996147629|-0.087692077|0.023128103|-0.662019594|\n",
    "|2|[0.49341106,0.22363988,0.28294897]|0|acrobot|-1|3|0.997141344|0.075558854|0.972780655|-0.231727853|0.035575821|-0.771402023|\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Preset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf.reset_default_graph() # just to clean things up; only needed for the tutorial\n",
    "\n",
    "#########\n",
    "# Agent #\n",
    "#########\n",
    "# note that we have moved to BCQ, which will help the training to converge better and faster\n",
    "agent_params = DDQNBCQAgentParameters() \n",
    "agent_params.network_wrappers['main'].batch_size = 128\n",
    "agent_params.algorithm.num_steps_between_copying_online_weights_to_target = TrainingSteps(100)\n",
    "agent_params.algorithm.discount = 0.99\n",
    "\n",
    "# to jump start the agent's q values, and speed things up, we'll initialize the last Dense layer\n",
    "# with something in the order of the discounted reward of a random policy\n",
    "agent_params.network_wrappers['main'].heads_parameters = \\\n",
    "[QHeadParameters(output_bias_initializer=tf.constant_initializer(-100))]\n",
    "\n",
    "# NN configuration\n",
    "agent_params.network_wrappers['main'].learning_rate = 0.0001\n",
    "agent_params.network_wrappers['main'].replace_mse_with_huber_loss = False\n",
    "\n",
    "# ER - we'll be needing an episodic replay buffer for off-policy evaluation\n",
    "agent_params.memory = EpisodicExperienceReplayParameters()\n",
    "\n",
    "# E-Greedy schedule - there is no exploration in Batch RL. Disabling E-Greedy. \n",
    "agent_params.exploration.epsilon_schedule = LinearSchedule(initial_value=0, final_value=0, decay_steps=1)\n",
    "agent_params.exploration.evaluation_epsilon = 0\n",
    "\n",
    "# can use either a kNN or a NN based model for predicting which actions not to max over in the bellman equation\n",
    "agent_params.algorithm.action_drop_method_parameters = KNNParameters()\n",
    "\n",
    "\n",
    "DATATSET_PATH = 'acrobot_dataset.csv'\n",
    "agent_params.memory = EpisodicExperienceReplayParameters()\n",
    "agent_params.memory.load_memory_from_file_path = CsvDataset(DATATSET_PATH, is_episodic = True)\n",
    "\n",
    "spaces = SpacesDefinition(state=StateSpace({'observation': VectorObservationSpace(shape=6)}),\n",
    "                          goal=None,\n",
    "                          action=DiscreteActionSpace(3),\n",
    "                          reward=RewardSpace(1))\n",
    "\n",
    "graph_manager = BatchRLGraphManager(agent_params=agent_params,\n",
    "                                    env_params=None,\n",
    "                                    spaces_definition=spaces,\n",
    "                                    schedule_params=schedule_params,\n",
    "                                    vis_params=VisualizationParameters(dump_signals_to_csv_every_x_episodes=1),\n",
    "                                    reward_model_num_epochs=30,\n",
    "                                    train_to_eval_ratio=0.4)\n",
    "graph_manager.create_graph(task_parameters)\n",
    "graph_manager.improve()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Selection with OPE\n",
    "Running the above preset will train an agent based on the experience in the csv dataset. Note that now we are finally demonstarting the real scenario with Batch Reinforcement Learning, where we train and evaluate solely based on the recorded dataset. Coach uses the same dataset (after internally splitting it, obviously) for both training and evaluation. \n",
    "\n",
    "Now that we have ran this preset, we have 100 agents (one is saved after every training epoch), and we would have to decide which one we choose for deployment (either for running another round of experience collection and training, or for final deployment, meaning going into production).  \n",
    "\n",
    "Opening the experiment csv in Dashboard and displaying the OPE signals, we can now choose a checkpoint file for deployment on the end-node. Here is an example run, where we show the `Weighted Importance Sampling` and `Sequential Doubly Robust` OPEs. \n",
    "\n",
    "![Model Selection](Resources/img/model_selection.png \"Model Selection using OPE\") \n",
    "\n",
    "Based on this plot we would probably have chosen a checkpoint from around Epoch 85. From here, if we are not satisfied with the deployed agent's performance, we can iteratively continue with data collection, policy training (maybe based on a combination of all the data collected so far), and deployment. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
